Problem presentation

Companies are often forced to deal with attrition, being the choice of employees to voluntarily leave the workplace. It is intuitive to see how this phenomenon could potentially constitute a problem for any company: in fact, employee turnover can be expensive because recruitment, hiring and training expenses need to be covered and the risk of disrupting the workplace stability and productivity increases. Hence the need to propose a strategy to foresee and, if possible, prevent attrition and possibly mitigate the consequences.

This premise motivates the decision to analyze the “Employee Attrition and Factors” dataset, concerning the reasons that may lead employees to leave the company they work at. We are dealing with a classification problem: our objective is to predict whether someone ultimately leaves their job or not and the motivations behind.

Dataset presentation

Let us start our dissertation by importing the selected dataset.

HR_Analytics <- read.csv("~/HR_Analytics.csv")
attach(HR_Analytics)
HR_Analytics <- HR_Analytics[, -c(4, 13)]

The chosen dataset is synthetically generated by IBM. It consists of 1470 instances, each corresponding to an employee of the company and described by 33 features, both numerical and categorical, consisting in the factors that may cause attrition. The column Attrition is a logical variable that indicates whether an employee has quit.

Name of the variable Type Description
Age Discrete The age of the employee
Attrition Nominal Whether or not the employee has left the organization
BusinessTravel Ordinal The frequency of business travel for the employee
Department Nominal The department the employee works in
DistanceFromHome Discrete The distance from home in miles for the employee
Education Ordinal The level of education achieved by the employee
EducationField Nominal The field of study for the employee’s education
EmployeeCount Discrete The total number of employees in the organization
EmployeeNumber Discrete A unique identifier for each employee profile
EnvironmentSatisfaction Ordinal The employee’s satisfaction with their work environment
Gender Nominal The gender of the employee
JobInvolvement Ordinal The level of involvement required for the employee’s job
JobLevel Ordinal The job level of the employee
JobRole Nominal The role of the employee in the organization
JobSatisfaction Ordinal The employee’s satisfaction with their job
MaritalStatus Nominal The marital status of the employee
MonthlyIncome Continuous The monthly income of the employee
MonthlyRate Continous The monthly rate of pay for the employee
NumCompaniesWorked Discrete The number of companies the employee has worked for
Over18 Nominal Whether or not the employee is over 18
OverTime Nominal Whether or not the employee works overtime
PercentSalaryHike Discrete The percentage of salary hike for the employee
PerformanceRating Ordinal The performance rating of the employee
RelationshipSatisfaction Ordinal The employee’s satisfaction with their relationships
StandardHours Discrete The standard hours of work for the employees
StockOptionLevel Ordinal The stock option level of the employee
TotalWorkingYears Discrete The total number of years the employee has worked
TrainingTimesLastYear Discrete The number of times the employee was taken for training in the last year
WorkLifeBalance Ordinal The employee’s perception of their work-life balance
YearsAtCompany Discrete The number of years the employee has been with the company
YearsInCurrentRole Discrete The number of years the employee has been in their current role
YearsSinceLastPromotion Discrete The number of years since the employee’s last promotion
YearsWithCurrManager Discrete The number of years the employee has been with their current manager

Data preprocessing

Our data analysis begins with the search for null values and missing data:

sum(is.na.data.frame(HR_Analytics))
## [1] 0
sum(duplicated(HR_Analytics))
## [1] 0

Taking a closer look to the features, we observed:

  • the presence of the column EmployeeNumber that does not give any substantial contribute to the analysis, as it is merely used to keep count of the employees;

  • the presence of the columns Over18, EmployeeCount, StandardHours that only take a single value each, making them uninformative. For example:

unique(Over18)
## [1] "Y"
  • inconsistency with the column NumCompaniesWorked that contains some 0 values. It is clear from the example below that such scenario is not possible (the employee must have worked at at least 2 companies, yet the first columns says otherwise). Therefore, we conclude that there may have been a data collection error.
HR_Analytics[6,c("NumCompaniesWorked", "TotalWorkingYears", "YearsAtCompany")]
##   NumCompaniesWorked TotalWorkingYears YearsAtCompany
## 6                  0                 8              7

Due to these reasons, we decide to remove the columns EmployeeNumber, Over18, EmployeeCount and StandardHours from our dataset. Additionaly, we removed any instances where the NumCompaniesWorked value was equal to 0..

HR_Analytics <- HR_Analytics[NumCompaniesWorked!= 0,]
HR_Analytics <- HR_Analytics[, -c(8, 9, 18, 25, 20)]

Exploratory Data Analysis

We now want to figure out how the features in the HR_Analytics dataset are distributed. Firstly, we take a look at the summary of it.

summary(HR_Analytics)
##       Age         Attrition         BusinessTravel      Department       
##  Min.   :18.00   Length:1273        Length:1273        Length:1273       
##  1st Qu.:30.00   Class :character   Class :character   Class :character  
##  Median :36.00   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :37.34                                                           
##  3rd Qu.:44.00                                                           
##  Max.   :60.00                                                           
##  DistanceFromHome   Education     EducationField     EnvironmentSatisfaction
##  Min.   : 1.000   Min.   :1.000   Length:1273        Min.   :1.00           
##  1st Qu.: 2.000   1st Qu.:2.000   Class :character   1st Qu.:2.00           
##  Median : 7.000   Median :3.000   Mode  :character   Median :3.00           
##  Mean   : 9.297   Mean   :2.933                      Mean   :2.72           
##  3rd Qu.:14.000   3rd Qu.:4.000                      3rd Qu.:4.00           
##  Max.   :29.000   Max.   :5.000                      Max.   :4.00           
##     Gender          JobInvolvement     JobLevel      JobRole         
##  Length:1273        Min.   :1.000   Min.   :1.00   Length:1273       
##  Class :character   1st Qu.:2.000   1st Qu.:1.00   Class :character  
##  Mode  :character   Median :3.000   Median :2.00   Mode  :character  
##                     Mean   :2.727   Mean   :2.09                     
##                     3rd Qu.:3.000   3rd Qu.:3.00                     
##                     Max.   :4.000   Max.   :5.00                     
##  JobSatisfaction MaritalStatus      MonthlyIncome   NumCompaniesWorked
##  Min.   :1.000   Length:1273        Min.   : 1009   Min.   :1.00      
##  1st Qu.:2.000   Class :character   1st Qu.: 2911   1st Qu.:1.00      
##  Median :3.000   Mode  :character   Median : 5067   Median :2.00      
##  Mean   :2.723                      Mean   : 6611   Mean   :3.11      
##  3rd Qu.:4.000                      3rd Qu.: 8621   3rd Qu.:4.00      
##  Max.   :4.000                      Max.   :19973   Max.   :9.00      
##    OverTime         PercentSalaryHike PerformanceRating
##  Length:1273        Min.   :11.00     Min.   :3.000    
##  Class :character   1st Qu.:12.00     1st Qu.:3.000    
##  Mode  :character   Median :14.00     Median :3.000    
##                     Mean   :15.19     Mean   :3.156    
##                     3rd Qu.:18.00     3rd Qu.:3.000    
##                     Max.   :25.00     Max.   :4.000    
##  RelationshipSatisfaction StockOptionLevel TotalWorkingYears
##  Min.   :1.000            Min.   :0.0000   Min.   : 0.00    
##  1st Qu.:2.000            1st Qu.:0.0000   1st Qu.: 6.00    
##  Median :3.000            Median :1.0000   Median :10.00    
##  Mean   :2.727            Mean   :0.8013   Mean   :11.59    
##  3rd Qu.:4.000            3rd Qu.:1.0000   3rd Qu.:16.00    
##  Max.   :4.000            Max.   :3.0000   Max.   :40.00    
##  TrainingTimesLastYear WorkLifeBalance YearsAtCompany   YearsInCurrentRole
##  Min.   :0.000         Min.   :1.000   Min.   : 0.000   Min.   : 0.000    
##  1st Qu.:2.000         1st Qu.:2.000   1st Qu.: 2.000   1st Qu.: 2.000    
##  Median :3.000         Median :3.000   Median : 5.000   Median : 3.000    
##  Mean   :2.787         Mean   :2.768   Mean   : 6.815   Mean   : 4.147    
##  3rd Qu.:3.000         3rd Qu.:3.000   3rd Qu.:10.000   3rd Qu.: 7.000    
##  Max.   :6.000         Max.   :4.000   Max.   :40.000   Max.   :18.000    
##  YearsSinceLastPromotion YearsWithCurrManager
##  Min.   : 0.000          Min.   : 0.000      
##  1st Qu.: 0.000          1st Qu.: 2.000      
##  Median : 1.000          Median : 3.000      
##  Mean   : 2.166          Mean   : 4.019      
##  3rd Qu.: 2.000          3rd Qu.: 7.000      
##  Max.   :15.000          Max.   :17.000

Description of variables:

table(HR_Analytics$Attrition)
## 
##   No  Yes 
## 1059  214

The distribution of the target variable Attrition is unbalanced, with approximately 84% of employees remaining in the company and the remaining 16% leaving.

Nominal categorical variables

Ordinal variables


Numerical (continuous and discrete) variables


We observe that the outliers in MonthlyIncome are entirely justifiable, as they correspond to higher-level positions (such as Research Director and Manager) with significant responsibilities, resulting in higher earnings.

Inspection of distributions concerning Attrition


We want to explore the distribution of all the features concerning Attrition to obtain a general overview and formulate meaningful questions to identify potential patterns.

Correlations


To better understand the relationships among the numerical variables in the dataset, we aim to visualize these relationships by creating a correlation matrix.

Here Cramer’s V correlation among categorical (nominal and ordinal) variables with respect to the target variable Attrition.

##                       CRAMER'S V
## [1] "Attrition-BusinessTravel           0.1184"
## [1] "Attrition-Department               0.09307"
## [1] "Attrition-EducationField           0.1029"
## [1] "Attrition-Gender                   0.04597"
## [1] "Attrition-MaritalStatus            0.1802"
## [1] "Attrition-OverTime                 0.2475"
## [1] "Attrition-JobRole                  0.2623"
## [1] "Attrition-Education                0.06153"
## [1] "Attrition-JobLevel                 0.2326"
## [1] "Attrition-StockOptionLevel         0.2134"
## [1] "Attrition-PerformanceRating        0.008406"
## [1] "Attrition-JobInvolvement           0.145"
## [1] "Attrition-JobSatisfaction          0.09588"
## [1] "Attrition-EnvironmentSatisfaction  0.1504"
## [1] "Attrition-RelationshipSatisfaction 0.05398"

The variables most strongly correlated with Attrition are: OverTime, JobRole, JobLevel, StockOptionLevel

Questions

  1. What are the characteristics most commonly exhibited by those who choose to leave?

    From the box plots, it’s evident that employees who leave the company tend to be younger than their counterparts who choose to stay. This aligns with recent studies indicating that the ‘Z’ and ‘Millennial’ generations are more inclined to change jobs frequently compared to previous generations. Additionally, we’ve noticed that among those who leave, female employees are generally younger than their male counterparts. This led us to wonder whether there’s an imbalanced gender distribution between those who leave and those who stay. However, based on the following plot, it’s clear that gender is not a defining factor in shaping the typical profile of those who decide to leave.

    We’re interested in examining how marital status is distributed among employees in the context of attrition. It’s worth highlighting that a substantial 50% of those who choose to leave the company are single employees.

    Let’s create a visualization that considers both age and other factors to gain a deeper insight into the typical profile of employees who decide to leave.

    Single employees, particularly those in the younger age groups (under 30), demonstrate a greater likelihood of leaving the company when compared to individuals with different marital statuses.

    Now, we want to visualize the distribution of BusinessTravel among the two Attrition classes.

    It appears that there is a higher presence of employees who travel frequently in the Attrition=Yes class compared to the opposite class (30% vs 15%).

    We can observe that among the groups represented, the highest percentage of attrition is among single employees who travel frequently, followed by single employees who travel rarely. Divorced employees who travel frequently also show a notable percentage of attrition.

  2. How does employee well-being impact the decision to leave?

    In the variables Job Satisfaction, Environment Satisfaction, Relationship Satisfaction, Work-Life Balance, and Job Involvement, we find indicators of employee well-being, contentment, and comfort within the company. Given this, and assuming that these variables may play a role in the decision to leave, we aim to investigate whether there is a significant relationship between each of these variables and the “Attrition” variable.

Before exploring this analysis, we want to gain an initial overview of overall employee satisfaction among those who choose to stay compared to those who decide to leave. To do this, we calculate the averages of the HR_Analytics Satisfaction variables (Job, Environment, Relationship) and examine their distributions.

Both boxplots share an identical mean; however, there is a notable contrast in the spread of the data, evident from the second quartile (median) and the third quartile (75th percentile). This indicates that individuals who leave tend to have an overall lower satisfaction level.

To evaluate the relationships and dependencies among the mentioned features (*Job Satisfaction*, *Environment Satisfaction*, *Relationship Satisfaction*, *Work-Life Balance*, *and Job Involvement*), we conducted a statistical analysis using the chi-square test. The chi-square test is a robust tool for examining associations between categorical variables and helps us determine whether there is a significant statistical relationship between these variables and Attrition. Here, we present the p-value results.


```
## [1] "Attrition-JobSatisfaction           p-value < 0.0084727510996434"
```

```
## [1] "Attrition-EnvironmentSatisfaction   p-value < 0.00000248436040568772"
```

```
## [1] "Attrition-RelationshipSatisfaction  p-value < 0.294647700977509"
```

```
## [1] "Attrition-WorkLifeBalance           p-value < 0.000385782940897055"
```

```
## [1] "Attrition-JobInvolvement            p-value < 0.00000656457066653085"
```

So, for all the considered variables (except for *RelationshipSatisfaction*) the p-value is very small (so far from 0.05), suggesting that we have evidence to reject the null hypothesis. We can say that there is a statistically significant association between *Attrition* and J*ob Satisfaction*, *Environment Satisfaction*, *Work-Life Balance*, and *Job Involvement.\
*In light of these results, we now want to examine how the overall view of satisfaction changes when considering the average solely between *JobSatisfaction* and *EnvironmentSatisfaction,* expecting an even clearer and more distinct interpretation than before.

<img src="StatisticalProject_files/figure-html/unnamed-chunk-44-1.png" width="672" />
  1. Does the decision to work overtime have an impact on the choice to leave the company?

    As we’ve observed from the overall view of Attrition among each variable, it’s evident that employees who work overtime exhibit a significantly higher Attrition rate, which stands at 28%. Furthermore, we can see that more than 50% of employees who leave are those who work overtime.

    Now, we want to conduct a chi-square test to analyze the relationship between these two variables.

    ## 
    ##  Pearson's Chi-squared test with Yates' continuity correction
    ## 
    ## data:  table(HR_Analytics$Attrition, HR_Analytics$OverTime)
    ## X-squared = 76.501, df = 1, p-value < 0.00000000000000022

    Due to the very small value of p-value we can reject the null hypothesis and statistically confirm that *OverTime* and *Attrition* are strongly associated.

  2. How does Monthly Income impact the choice to leave?

    Considering the previous plots that depict the distribution of Monthly Income in relation to Attrition, it becomes apparent that individuals who decide to leave typically earn less.

    We intend to confirm this by conducting a Wilcoxon rank sum test to compare the distribution of Monthly Income (a numerical variable) between the two distinct groups defined by Attrition.

    ## 
    ##  Wilcoxon rank sum test with continuity correction
    ## 
    ## data:  MonthlyIncome by Attrition
    ## W = 149609, p-value = 0.0000000000001368
    ## alternative hypothesis: true location shift is not equal to 0

    Due to the small p-value that we have obtained we can reject the null hypotehsis indicating that there is strong evidence of a significant difference in monthly income between employees who have experienced attrition and those who have not.

  3. Which roles are most critical?

    Having examined potential factors that could influence the decision to leave, our next step is to explore these factors across different Job Role to identify areas where intervention may be necessary.

    To begin, we’ll illustrate how Attrition is distributed among various job roles.

    It seems that the most critical roles are Sales Representative (with approximately 40% of Attrition), Human Resources and Laboratory Technician.

    Now, we want to visualize whether some of these roles are more significantly affected by the “younger employee attrition phenomenon.”

    We observe that younger employees who choose to leave are predominantly found among the roles of Sales Representative, Human Resources, Laboratory Technician.

    We can notice that the most critical salary-related issues manifest in the following roles: Sales Representative, Research Scientist, Laboratory Technician, and Human Resources.In contrast, an opposite trend is evident in the roles of Research Director and Healthcare Representative.

    We want to determine if there are roles in which the level of satisfaction, as measured by the mean of Job Satisfaction and Environment Satisfaction, is lower compared to others.

    With the exception of the Research Director, it appears that those who leave the company tend to have lower overall satisfaction. Notably, this trend is particularly pronounced in roles such as Human Resources, Healthcare Representative, and Sales Representative.

    The following plot illustrates that employees who work overtime and decide to leave are predominantly found among Sales Representatives, Laboratory Technicians, and Human Resources.

Classification

Our intent is to predict whether an employee leaves or not the workplace. Therefore, we decided to use the following classification models:

Notice that it is necessary to encode categorical variables so that they can be considered numeric and hence be used in the models: we did so by converting to numeric the levels of such columns.

Let us define a function that allows to rapidly and effectively compute metrics such as accuracy, precision, recall, specificity and F1 score, given the respective confusion matrix.

compute_metrics <- function(confusion_matrix) {
  tp <- confusion_matrix[2, 2]
  tn <- confusion_matrix[1, 1]
  fp <- confusion_matrix[2, 1]
  fn <- confusion_matrix[1, 2]
  accuracy <- (tp + tn) / sum(confusion_matrix)
  precision <- tp / (tp + fp)
  recall <- tp / (tp + fn)
  specificity <- tn / (tn + fp)
  f1_score <- 2 * (precision * recall) / (precision + recall)
  metrics = data.frame(metric = c("Accuracy", "Precision", "Recall", "Specificity", "F1 score"),
                       values = c(accuracy, precision, recall, specificity, f1_score))
  print(t(metrics))
}

We reckon that our priority is to detect people that intend to leave the workplace, therefore the metric we aim to maximize is recall. In fact, we are less concerned to misclassify an employee that doesn’t actually plan on leaving the company than missing someone who intents to quit. Also, as we are dealing with an unbalanced dataset, we are going to prioritize the F1-score metric over accuracy, as the former factors in, by definition, both recall and precision, thus taking in consideration the prediction performance of both classes.

Lastly, we split our dataset in training set and test set with a 75/25 ratio, setting a seed for reproducibility purposes.

set.seed(108)

n <- nrow(HR_Analytics)
train <- sample(1:n, round(n*0.75))
test <- setdiff(1:n, train)

train.data <- HR_Analytics[train, ]
test.data <- HR_Analytics[test, ]
table(test.data$Attrition)
## 
##   0   1 
## 269  49

Logistic Regression with Unbalanced Dataset

The first model used is Logistic Regression.

full.model <- glm(train.data$Attrition ~ ., data = train.data, family = binomial)
summary(full.model)
## 
## Call:
## glm(formula = train.data$Attrition ~ ., family = binomial, data = train.data)
## 
## Coefficients:
##                              Estimate   Std. Error z value             Pr(>|z|)
## (Intercept)               5.639564717  1.479661601   3.811             0.000138
## Age                      -0.026071195  0.016149084  -1.614             0.106439
## BusinessTravel            0.977684544  0.213482294   4.580           0.00000466
## Department               -0.804382863  0.225850155  -3.562             0.000369
## DistanceFromHome          0.038723594  0.013335771   2.904             0.003687
## Education                -0.114273331  0.106080303  -1.077             0.281376
## EducationField            0.146811797  0.076331278   1.923             0.054436
## EnvironmentSatisfaction  -0.436455313  0.101031694  -4.320           0.00001560
## Gender                   -0.583007162  0.228740128  -2.549             0.010810
## JobInvolvement           -0.521785893  0.147591426  -3.535             0.000407
## JobLevel                 -0.581758995  0.346956885  -1.677             0.093592
## JobRole                   0.078327691  0.053558574   1.462             0.143613
## JobSatisfaction          -0.368926405  0.101850617  -3.622             0.000292
## MaritalStatus            -0.550908340  0.199859765  -2.756             0.005843
## MonthlyIncome            -0.000001702  0.000080807  -0.021             0.983191
## NumCompaniesWorked        0.189292061  0.048271537   3.921           0.00008804
## OverTime                  2.094202844  0.233260253   8.978 < 0.0000000000000002
## PercentSalaryHike        -0.064494040  0.047736316  -1.351             0.176680
## PerformanceRating         0.209080116  0.489693979   0.427             0.669408
## RelationshipSatisfaction -0.175882249  0.102533741  -1.715             0.086279
## StockOptionLevel         -0.214702940  0.174292000  -1.232             0.218002
## TotalWorkingYears        -0.066036861  0.033291112  -1.984             0.047298
## TrainingTimesLastYear    -0.158876646  0.091248738  -1.741             0.081659
## WorkLifeBalance          -0.541483996  0.148526190  -3.646             0.000267
## YearsAtCompany            0.128179768  0.043174647   2.969             0.002989
## YearsInCurrentRole       -0.149498746  0.055918660  -2.674             0.007506
## YearsSinceLastPromotion   0.149768408  0.050488996   2.966             0.003014
## YearsWithCurrManager     -0.151385099  0.057652148  -2.626             0.008644
##                             
## (Intercept)              ***
## Age                         
## BusinessTravel           ***
## Department               ***
## DistanceFromHome         ** 
## Education                   
## EducationField           .  
## EnvironmentSatisfaction  ***
## Gender                   *  
## JobInvolvement           ***
## JobLevel                 .  
## JobRole                     
## JobSatisfaction          ***
## MaritalStatus            ** 
## MonthlyIncome               
## NumCompaniesWorked       ***
## OverTime                 ***
## PercentSalaryHike           
## PerformanceRating           
## RelationshipSatisfaction .  
## StockOptionLevel            
## TotalWorkingYears        *  
## TrainingTimesLastYear    .  
## WorkLifeBalance          ***
## YearsAtCompany           ** 
## YearsInCurrentRole       ** 
## YearsSinceLastPromotion  ** 
## YearsWithCurrManager     ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 879.09  on 954  degrees of freedom
## Residual deviance: 576.27  on 927  degrees of freedom
## AIC: 632.27
## 
## Number of Fisher Scoring iterations: 6

We can test different thresholds, minding that theoretically a smaller threshold is fitted to improve recall. We are trying out 0.4, 0.5 and 0.6.

predicted <- predict(full.model, newdata = test.data, type = "response")

for (t in c(0.4, 0.5, 0.6)){
predicted.classes <- ifelse(predicted > t, 1, 0)
confusion.matrix <- table(predicted.classes, test.data$Attrition)
print(confusion.matrix)
compute_metrics(confusion.matrix)
}
##                  
## predicted.classes   0   1
##                 0 243  27
##                 1  26  22
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.8333333" "0.4583333" "0.4489796" "0.9033457"   "0.4536082"
##                  
## predicted.classes   0   1
##                 0 258  31
##                 1  11  18
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.8679245" "0.6206897" "0.3673469" "0.9591078"   "0.4615385"
##                  
## predicted.classes   0   1
##                 0 264  33
##                 1   5  16
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.8805031" "0.7619048" "0.3265306" "0.9814126"   "0.4571429"

As we could have foreseen, a threshold of 0.4 best fits our purpose. This way, our model will tend to easily classify more instances as 1. Nevertheless, we think that such results can be further improved.

To assess collinearity in the model, we think it is advisable to compute the Variance Inflation Factor (VIF), reputing any variable with VIF over 5 problematic. In fact, a high VIF for a variable indicates that it is strongly correlated with other independent variables in the model and this can lead to unstable coefficient estimates and decreased interpretability.

vif(full.model)
##                      Age           BusinessTravel               Department 
##                 1.981978                 1.114924                 1.286799 
##         DistanceFromHome                Education           EducationField 
##                 1.084124                 1.084458                 1.061957 
##  EnvironmentSatisfaction                   Gender           JobInvolvement 
##                 1.089613                 1.061261                 1.059996 
##                 JobLevel                  JobRole          JobSatisfaction 
##                 8.943373                 1.287902                 1.083157 
##            MaritalStatus            MonthlyIncome       NumCompaniesWorked 
##                 1.857679                 8.121151                 1.369443 
##                 OverTime        PercentSalaryHike        PerformanceRating 
##                 1.202241                 2.562508                 2.552855 
## RelationshipSatisfaction         StockOptionLevel        TotalWorkingYears 
##                 1.108647                 1.822890                 4.477195 
##    TrainingTimesLastYear          WorkLifeBalance           YearsAtCompany 
##                 1.061533                 1.106284                 5.368819 
##       YearsInCurrentRole  YearsSinceLastPromotion     YearsWithCurrManager 
##                 2.756052                 2.327484                 2.749695

It’s clear that there are only a few troublesome collinear variables having VIF values over 5. We have procede to eliminate them one by one from our model, ultimately producing the following model where the variables JobLevel and YearsAtCompany are gone.

full.model.1 <- glm(train.data$Attrition ~ . - JobLevel - YearsAtCompany, data = train.data, family = binomial)
summary(full.model.1)
## 
## Call:
## glm(formula = train.data$Attrition ~ . - JobLevel - YearsAtCompany, 
##     family = binomial, data = train.data)
## 
## Coefficients:
##                            Estimate Std. Error z value             Pr(>|z|)    
## (Intercept)               5.3738502  1.4509513   3.704             0.000212 ***
## Age                      -0.0287792  0.0159886  -1.800             0.071863 .  
## BusinessTravel            0.9459034  0.2107574   4.488           0.00000719 ***
## Department               -0.6801298  0.2204704  -3.085             0.002036 ** 
## DistanceFromHome          0.0354978  0.0130315   2.724             0.006449 ** 
## Education                -0.1245191  0.1053929  -1.181             0.237414    
## EducationField            0.1379003  0.0753363   1.830             0.067181 .  
## EnvironmentSatisfaction  -0.4361269  0.1000806  -4.358           0.00001314 ***
## Gender                   -0.5756776  0.2262071  -2.545             0.010930 *  
## JobInvolvement           -0.5402624  0.1459430  -3.702             0.000214 ***
## JobRole                   0.0811444  0.0531546   1.527             0.126867    
## JobSatisfaction          -0.3251737  0.0990701  -3.282             0.001030 ** 
## MaritalStatus            -0.5289013  0.1976041  -2.677             0.007438 ** 
## MonthlyIncome            -0.0001017  0.0000431  -2.360             0.018261 *  
## NumCompaniesWorked        0.1628835  0.0472813   3.445             0.000571 ***
## OverTime                  2.0655622  0.2304888   8.962 < 0.0000000000000002 ***
## PercentSalaryHike        -0.0609106  0.0478049  -1.274             0.202611    
## PerformanceRating         0.1682054  0.4885533   0.344             0.730626    
## RelationshipSatisfaction -0.1716181  0.1019241  -1.684             0.092223 .  
## StockOptionLevel         -0.2185963  0.1736300  -1.259             0.208038    
## TotalWorkingYears        -0.0510323  0.0300978  -1.696             0.089971 .  
## TrainingTimesLastYear    -0.1519060  0.0897364  -1.693             0.090493 .  
## WorkLifeBalance          -0.5275943  0.1467449  -3.595             0.000324 ***
## YearsInCurrentRole       -0.0932801  0.0531587  -1.755             0.079302 .  
## YearsSinceLastPromotion   0.2010928  0.0481103   4.180           0.00002917 ***
## YearsWithCurrManager     -0.0884568  0.0559007  -1.582             0.113560    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 879.09  on 954  degrees of freedom
## Residual deviance: 587.00  on 929  degrees of freedom
## AIC: 639
## 
## Number of Fisher Scoring iterations: 6

We can notice from the summary that there are still a lot of non significant variables (with a high p-value) in this model, but we can further improve this using a feature selection approach: we tested both forward and backward algorithms, using the function step.

In the forward feature selection algorithm, we start with a model with only one variable and progressively add one variable at a time so that the model performance is improved. On the other hand, the backward algorithm consists of iteratively removing one variable at a time starting from the complete model in such a way that the AIC keeps decreasing. AIC stands for Akaike Information Criterion and statistically quantifies the compromise between a good fitting model and its complexity.

We have observed that the backward selection algorithm provides a more performing model with a lower AIC, so that is the algorithm that we are going to keep using for the entire project.

full.model.step <- step(full.model.1, direction = "backward")
## Start:  AIC=639
## train.data$Attrition ~ (Age + BusinessTravel + Department + DistanceFromHome + 
##     Education + EducationField + EnvironmentSatisfaction + Gender + 
##     JobInvolvement + JobLevel + JobRole + JobSatisfaction + MaritalStatus + 
##     MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + 
##     PerformanceRating + RelationshipSatisfaction + StockOptionLevel + 
##     TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + 
##     YearsAtCompany + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager) - JobLevel - YearsAtCompany
## 
##                            Df Deviance    AIC
## - PerformanceRating         1   587.12 637.12
## - Education                 1   588.40 638.40
## - StockOptionLevel          1   588.62 638.62
## - PercentSalaryHike         1   588.65 638.65
## <none>                          587.00 639.00
## - JobRole                   1   589.31 639.31
## - YearsWithCurrManager      1   589.52 639.52
## - RelationshipSatisfaction  1   589.85 639.85
## - TrainingTimesLastYear     1   589.93 639.93
## - TotalWorkingYears         1   589.96 639.96
## - YearsInCurrentRole        1   590.14 640.14
## - Age                       1   590.33 640.33
## - EducationField            1   590.35 640.35
## - MonthlyIncome             1   592.79 642.79
## - Gender                    1   593.70 643.70
## - DistanceFromHome          1   594.32 644.32
## - MaritalStatus             1   594.44 644.44
## - Department                1   596.53 646.53
## - JobSatisfaction           1   597.97 647.97
## - NumCompaniesWorked        1   598.71 648.71
## - WorkLifeBalance           1   600.10 650.10
## - JobInvolvement            1   600.97 650.97
## - YearsSinceLastPromotion   1   604.55 654.55
## - EnvironmentSatisfaction   1   606.64 656.64
## - BusinessTravel            1   608.14 658.14
## - OverTime                  1   677.64 727.64
## 
## Step:  AIC=637.12
## train.data$Attrition ~ Age + BusinessTravel + Department + DistanceFromHome + 
##     Education + EducationField + EnvironmentSatisfaction + Gender + 
##     JobInvolvement + JobRole + JobSatisfaction + MaritalStatus + 
##     MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + 
##     RelationshipSatisfaction + StockOptionLevel + TotalWorkingYears + 
##     TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + 
##     YearsSinceLastPromotion + YearsWithCurrManager
## 
##                            Df Deviance    AIC
## - Education                 1   588.52 636.52
## - StockOptionLevel          1   588.74 636.74
## <none>                          587.12 637.12
## - JobRole                   1   589.39 637.39
## - YearsWithCurrManager      1   589.67 637.67
## - PercentSalaryHike         1   589.71 637.71
## - RelationshipSatisfaction  1   589.98 637.98
## - TotalWorkingYears         1   590.04 638.04
## - TrainingTimesLastYear     1   590.10 638.10
## - YearsInCurrentRole        1   590.20 638.20
## - Age                       1   590.45 638.45
## - EducationField            1   590.47 638.47
## - MonthlyIncome             1   592.93 640.93
## - Gender                    1   593.76 641.76
## - DistanceFromHome          1   594.35 642.35
## - MaritalStatus             1   594.58 642.58
## - Department                1   596.56 644.56
## - JobSatisfaction           1   598.01 646.01
## - NumCompaniesWorked        1   598.71 646.71
## - WorkLifeBalance           1   600.36 648.36
## - JobInvolvement            1   601.13 649.13
## - YearsSinceLastPromotion   1   604.73 652.73
## - EnvironmentSatisfaction   1   606.69 654.69
## - BusinessTravel            1   608.39 656.39
## - OverTime                  1   678.05 726.05
## 
## Step:  AIC=636.52
## train.data$Attrition ~ Age + BusinessTravel + Department + DistanceFromHome + 
##     EducationField + EnvironmentSatisfaction + Gender + JobInvolvement + 
##     JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + 
##     NumCompaniesWorked + OverTime + PercentSalaryHike + RelationshipSatisfaction + 
##     StockOptionLevel + TotalWorkingYears + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager
## 
##                            Df Deviance    AIC
## - StockOptionLevel          1   590.36 636.36
## <none>                          588.52 636.52
## - JobRole                   1   590.98 636.98
## - PercentSalaryHike         1   591.02 637.02
## - YearsWithCurrManager      1   591.15 637.15
## - RelationshipSatisfaction  1   591.44 637.44
## - TotalWorkingYears         1   591.47 637.47
## - TrainingTimesLastYear     1   591.47 637.47
## - YearsInCurrentRole        1   591.52 637.52
## - EducationField            1   591.89 637.89
## - Age                       1   592.44 638.44
## - MonthlyIncome             1   594.36 640.36
## - DistanceFromHome          1   595.48 641.48
## - Gender                    1   595.54 641.54
## - MaritalStatus             1   595.76 641.76
## - Department                1   598.30 644.30
## - JobSatisfaction           1   599.05 645.05
## - NumCompaniesWorked        1   599.87 645.87
## - WorkLifeBalance           1   601.52 647.52
## - JobInvolvement            1   602.88 648.88
## - YearsSinceLastPromotion   1   605.64 651.64
## - EnvironmentSatisfaction   1   607.28 653.28
## - BusinessTravel            1   609.38 655.38
## - OverTime                  1   679.48 725.48
## 
## Step:  AIC=636.36
## train.data$Attrition ~ Age + BusinessTravel + Department + DistanceFromHome + 
##     EducationField + EnvironmentSatisfaction + Gender + JobInvolvement + 
##     JobRole + JobSatisfaction + MaritalStatus + MonthlyIncome + 
##     NumCompaniesWorked + OverTime + PercentSalaryHike + RelationshipSatisfaction + 
##     TotalWorkingYears + TrainingTimesLastYear + WorkLifeBalance + 
##     YearsInCurrentRole + YearsSinceLastPromotion + YearsWithCurrManager
## 
##                            Df Deviance    AIC
## <none>                          590.36 636.36
## - JobRole                   1   592.87 636.87
## - RelationshipSatisfaction  1   592.92 636.92
## - PercentSalaryHike         1   592.97 636.97
## - YearsWithCurrManager      1   593.04 637.04
## - TotalWorkingYears         1   593.14 637.14
## - TrainingTimesLastYear     1   593.40 637.40
## - YearsInCurrentRole        1   593.52 637.52
## - EducationField            1   593.62 637.62
## - Age                       1   594.40 638.40
## - MonthlyIncome             1   596.20 640.20
## - Gender                    1   596.91 640.91
## - DistanceFromHome          1   596.99 640.99
## - Department                1   600.10 644.10
## - NumCompaniesWorked        1   601.30 645.30
## - JobSatisfaction           1   601.34 645.34
## - WorkLifeBalance           1   604.32 648.32
## - JobInvolvement            1   604.96 648.96
## - YearsSinceLastPromotion   1   607.52 651.52
## - EnvironmentSatisfaction   1   608.70 652.70
## - BusinessTravel            1   611.31 655.31
## - MaritalStatus             1   612.45 656.45
## - OverTime                  1   681.03 725.03
summary(full.model.step)
## 
## Call:
## glm(formula = train.data$Attrition ~ Age + BusinessTravel + Department + 
##     DistanceFromHome + EducationField + EnvironmentSatisfaction + 
##     Gender + JobInvolvement + JobRole + JobSatisfaction + MaritalStatus + 
##     MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + 
##     RelationshipSatisfaction + TotalWorkingYears + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion + 
##     YearsWithCurrManager, family = binomial, data = train.data)
## 
## Coefficients:
##                             Estimate  Std. Error z value             Pr(>|z|)
## (Intercept)               5.47779689  1.03771114   5.279           0.00000013
## Age                      -0.03141510  0.01587514  -1.979             0.047829
## BusinessTravel            0.93356313  0.20889928   4.469           0.00000786
## Department               -0.68574034  0.21995900  -3.118             0.001823
## DistanceFromHome          0.03344059  0.01289453   2.593             0.009503
## EducationField            0.13571811  0.07513388   1.806             0.070864
## EnvironmentSatisfaction  -0.41774453  0.09907358  -4.217           0.00002481
## Gender                   -0.56585144  0.22478896  -2.517             0.011827
## JobInvolvement           -0.54840057  0.14499312  -3.782             0.000155
## JobRole                   0.08456021  0.05316170   1.591             0.111694
## JobSatisfaction          -0.32316086  0.09841699  -3.284             0.001025
## MaritalStatus            -0.69416248  0.15259129  -4.549           0.00000539
## MonthlyIncome            -0.00010191  0.00004299  -2.371             0.017757
## NumCompaniesWorked        0.15569052  0.04675489   3.330             0.000869
## OverTime                  2.05679276  0.22919926   8.974 < 0.0000000000000002
## PercentSalaryHike        -0.04826242  0.03028913  -1.593             0.111073
## RelationshipSatisfaction -0.16127034  0.10096136  -1.597             0.110188
## TotalWorkingYears        -0.04941084  0.03003061  -1.645             0.099898
## TrainingTimesLastYear    -0.15494378  0.08986337  -1.724             0.084669
## WorkLifeBalance          -0.54247129  0.14618108  -3.711             0.000206
## YearsInCurrentRole       -0.09313771  0.05284844  -1.762             0.078009
## YearsSinceLastPromotion   0.19775349  0.04783774   4.134           0.00003568
## YearsWithCurrManager     -0.09085149  0.05570928  -1.631             0.102930
##                             
## (Intercept)              ***
## Age                      *  
## BusinessTravel           ***
## Department               ** 
## DistanceFromHome         ** 
## EducationField           .  
## EnvironmentSatisfaction  ***
## Gender                   *  
## JobInvolvement           ***
## JobRole                     
## JobSatisfaction          ** 
## MaritalStatus            ***
## MonthlyIncome            *  
## NumCompaniesWorked       ***
## OverTime                 ***
## PercentSalaryHike           
## RelationshipSatisfaction    
## TotalWorkingYears        .  
## TrainingTimesLastYear    .  
## WorkLifeBalance          ***
## YearsInCurrentRole       .  
## YearsSinceLastPromotion  ***
## YearsWithCurrManager        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 879.09  on 954  degrees of freedom
## Residual deviance: 590.36  on 932  degrees of freedom
## AIC: 636.36
## 
## Number of Fisher Scoring iterations: 6

The function step has produced a model in which variables Education, PerformanceRating and StockOptionLevel are excluded, whereas the most significant features are:

  • BusinessTravel
  • EnvironmentSatisfaction
  • JobInvolvment
  • MaritalStatus
  • NumCompaniesWorked
  • OverTime
  • WorkLifeBalance
  • YearsSinceLastPromotion

In particular, OverTime has a significantly smaller p-value (<2e-16) compared to the other predictor variables.

From now on, for every model we are still going to try out different thresholds (choosing from 0.4, 0.5, 0.6), but for brevity purposes we are going to print directly the best model out of the three. Here follows the confusion matrix and the relative metrics obtained by the model, using a threshold of 0.4:

##                       
## predicted.classes.step   0   1
##                      0 248  28
##                      1  21  21
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.8459119" "0.5000000" "0.4285714" "0.9219331"   "0.4615385"
df_metrics <- data.frame(Model_name = character(),
                         Accuracy = numeric(),
                         Precision = numeric(),
                         Recall = numeric(),
                         F1_score = numeric(),
                         PR_AUC = numeric())
colnames(df_metrics) <- c("Model_name", "Accuracy", "Precision", "Recall", "F1_score", "PR_AUC")

Model_name <- "Logistic Regression Unbalanced"


accuracy <- as.numeric(metrics_lr_unb[2, which(metrics_lr_unb[1, ] == "Accuracy")])
precision <- as.numeric(metrics_lr_unb[2, which(metrics_lr_unb[1, ] == "Precision")])
recall <- as.numeric(metrics_lr_unb[2, which(metrics_lr_unb[1, ] == "Recall")])
specificity <- as.numeric(metrics_lr_unb[2, which(metrics_lr_unb[1, ] == "Specificity")])
f1_score <- as.numeric(metrics_lr_unb[2, which(metrics_lr_unb[1, ] == "F1 score")])

class0_indices = which(test.data$Attrition == 0)
pr_curve_glm_unb <- pr.curve(scores.class0 = predicted.classes.step[class0_indices], scores.class1 = predicted.classes.step[-class0_indices], curve=TRUE, max.compute = T, min.compute = T, rand.compute = T)
pr_auc_unb <- as.numeric(pr_curve_glm_unb$auc.integral)

new_row <- data.frame(Model_name = Model_name,
                      Accuracy = accuracy,
                      Precision = precision,
                      Recall = recall,
                      F1_score = f1_score,
                      PR_AUC = pr_auc_unb)

df_metrics <- rbind(df_metrics, new_row)

Logistic Regression with Balanced Dataset

Since we are working with an unbalanced dataset, it’s recommended to employ sampling techniques to address the class imbalance issue. This approach will enable the model to capture patterns from the minority class, which is crucial, as our main objective is to accurately detect individuals who intend to leave. We’ll focus solely on balancing the training set using the ovun.sample function and the SMOTE method while maintaining the test set’s original ratio. Furthermore, we continue to take into account the considerations derived from the VIF values of the full model.

Keep in mind that we are going to thoroughly address every sampling technique specifically for logistic regression. In fact, we have come to the conclusion that oversampling produces the most performant models. Therefore, we will only focus on unbalanced and oversampled training sets for the rest of the models.

Sampling with ovun.sample

We initially opted for undersampling our data using the ovun.sample function. However, we eventually decided to discard this approach because it resulted in a training set even smaller than the test set. In the end, we chose to sample the training set while maintaining its original size, ensuring that the two classes are perfectly balanced with a probability of p = 0.5.

under.train.data <- ovun.sample(Attrition ~ ., data = train.data, N = nrow(train.data), p = 0.5, seed=108)$data

The training data is now split in the following way:

table(under.train.data$Attrition)
## 
##   0   1 
## 480 475

Here follows the application of the Logistic Regression model on the new training set.

full.model.under <- glm(under.train.data$Attrition ~ . - JobLevel - YearsAtCompany, data = under.train.data, family = binomial)
summary(full.model.under)
## 
## Call:
## glm(formula = under.train.data$Attrition ~ . - JobLevel - YearsAtCompany, 
##     family = binomial, data = under.train.data)
## 
## Coefficients:
##                             Estimate  Std. Error z value             Pr(>|z|)
## (Intercept)               6.46010374  1.24111229   5.205        0.00000019390
## Age                      -0.01587964  0.01282487  -1.238             0.215645
## BusinessTravel            1.07513350  0.18140156   5.927        0.00000000309
## Department               -0.42766584  0.18280250  -2.339             0.019310
## DistanceFromHome          0.02708938  0.01105806   2.450             0.014296
## Education                -0.20376032  0.09285701  -2.194             0.028211
## EducationField            0.17634204  0.06125771   2.879             0.003993
## EnvironmentSatisfaction  -0.31050304  0.07995386  -3.884             0.000103
## Gender                   -0.33431166  0.18882197  -1.771             0.076642
## JobInvolvement           -0.49621082  0.11944054  -4.154        0.00003260582
## JobRole                   0.03627219  0.04572943   0.793             0.427666
## JobSatisfaction          -0.43443229  0.08065933  -5.386        0.00000007204
## MaritalStatus            -0.66325200  0.16759086  -3.958        0.00007571717
## MonthlyIncome            -0.00008609  0.00003380  -2.547             0.010860
## NumCompaniesWorked        0.12644992  0.03754910   3.368             0.000758
## OverTime                  2.06213804  0.19421566  10.618 < 0.0000000000000002
## PercentSalaryHike        -0.12916719  0.04077693  -3.168             0.001537
## PerformanceRating         0.51731295  0.41191643   1.256             0.209164
## RelationshipSatisfaction -0.16175836  0.08440645  -1.916             0.055311
## StockOptionLevel         -0.02886215  0.14288159  -0.202             0.839916
## TotalWorkingYears        -0.03596412  0.02321043  -1.549             0.121266
## TrainingTimesLastYear    -0.25877777  0.08008123  -3.231             0.001232
## WorkLifeBalance          -0.51773725  0.12215665  -4.238        0.00002252126
## YearsInCurrentRole       -0.16933580  0.04634446  -3.654             0.000258
## YearsSinceLastPromotion   0.19339453  0.03945852   4.901        0.00000095248
## YearsWithCurrManager     -0.05342510  0.04487224  -1.191             0.233809
##                             
## (Intercept)              ***
## Age                         
## BusinessTravel           ***
## Department               *  
## DistanceFromHome         *  
## Education                *  
## EducationField           ** 
## EnvironmentSatisfaction  ***
## Gender                   .  
## JobInvolvement           ***
## JobRole                     
## JobSatisfaction          ***
## MaritalStatus            ***
## MonthlyIncome            *  
## NumCompaniesWorked       ***
## OverTime                 ***
## PercentSalaryHike        ** 
## PerformanceRating           
## RelationshipSatisfaction .  
## StockOptionLevel            
## TotalWorkingYears           
## TrainingTimesLastYear    ** 
## WorkLifeBalance          ***
## YearsInCurrentRole       ***
## YearsSinceLastPromotion  ***
## YearsWithCurrManager        
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1323.88  on 954  degrees of freedom
## Residual deviance:  824.66  on 929  degrees of freedom
## AIC: 876.66
## 
## Number of Fisher Scoring iterations: 5

Notice that this model has a higher AIC than the previous models. The features with 0 significance level are the following:

  • BusinessTravel

  • EnvironmentSatisfaction

  • JobInvolvement

  • JobSatisfation

  • MaritalStatus

  • NumCompaniesWorked

  • OverTime

  • WorkLifeBalance

  • YearsInCurrentRole

  • YearsSinceLastPromotion

##                        
## predicted.classes.under   0   1
##                       0 177   9
##                       1  92  40
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.6823899" "0.3030303" "0.8163265" "0.6579926"   "0.4419890"

Evidently, the chosen threshold of 0.4 provides a reasonable compromise between recall and F1-score. According to this model, the two most significant features are OverTime and Department.

As previously, we will select some features, and as expected, a backward approach is preferred.

summary(model.step.under)
## 
## Call:
## glm(formula = under.train.data$Attrition ~ BusinessTravel + Department + 
##     DistanceFromHome + Education + EducationField + EnvironmentSatisfaction + 
##     Gender + JobInvolvement + JobSatisfaction + MaritalStatus + 
##     MonthlyIncome + NumCompaniesWorked + OverTime + PercentSalaryHike + 
##     RelationshipSatisfaction + TotalWorkingYears + TrainingTimesLastYear + 
##     WorkLifeBalance + YearsInCurrentRole + YearsSinceLastPromotion, 
##     family = binomial, data = under.train.data)
## 
## Coefficients:
##                             Estimate  Std. Error z value             Pr(>|z|)
## (Intercept)               7.25033727  0.87490609   8.287 < 0.0000000000000002
## BusinessTravel            1.06623630  0.17705730   6.022        0.00000000172
## Department               -0.36490435  0.16731124  -2.181             0.029184
## DistanceFromHome          0.02403790  0.01093905   2.197             0.027989
## Education                -0.22338595  0.09174487  -2.435             0.014898
## EducationField            0.18498218  0.05955102   3.106             0.001895
## EnvironmentSatisfaction  -0.31382789  0.07929137  -3.958        0.00007560935
## Gender                   -0.30011663  0.18576233  -1.616             0.106182
## JobInvolvement           -0.50971128  0.11870425  -4.294        0.00001755143
## JobSatisfaction          -0.43759959  0.07966957  -5.493        0.00000003959
## MaritalStatus            -0.67487960  0.12787788  -5.278        0.00000013094
## MonthlyIncome            -0.00008727  0.00003300  -2.645             0.008172
## NumCompaniesWorked        0.11790024  0.03692251   3.193             0.001407
## OverTime                  2.07575450  0.19268544  10.773 < 0.0000000000000002
## PercentSalaryHike        -0.09162363  0.02492224  -3.676             0.000237
## RelationshipSatisfaction -0.16989380  0.08220426  -2.067             0.038760
## TotalWorkingYears        -0.05127510  0.02040634  -2.513             0.011981
## TrainingTimesLastYear    -0.27444513  0.07847603  -3.497             0.000470
## WorkLifeBalance          -0.50123394  0.12016514  -4.171        0.00003029876
## YearsInCurrentRole       -0.19962263  0.03687695  -5.413        0.00000006191
## YearsSinceLastPromotion   0.18319257  0.03894127   4.704        0.00000254701
##                             
## (Intercept)              ***
## BusinessTravel           ***
## Department               *  
## DistanceFromHome         *  
## Education                *  
## EducationField           ** 
## EnvironmentSatisfaction  ***
## Gender                      
## JobInvolvement           ***
## JobSatisfaction          ***
## MaritalStatus            ***
## MonthlyIncome            ** 
## NumCompaniesWorked       ** 
## OverTime                 ***
## PercentSalaryHike        ***
## RelationshipSatisfaction *  
## TotalWorkingYears        *  
## TrainingTimesLastYear    ***
## WorkLifeBalance          ***
## YearsInCurrentRole       ***
## YearsSinceLastPromotion  ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1323.9  on 954  degrees of freedom
## Residual deviance:  829.9  on 934  degrees of freedom
## AIC: 871.9
## 
## Number of Fisher Scoring iterations: 5

We can see that the function step has removed in the model model.step.under the features:

  • Age

  • JobRole

  • StockOptionLevel

  • YearsWithCurrentManager

It yields the following results using a threshold of 0.4:

print(confusion.matrix.step.under)
##                             
## predicted.classes.step.under   0   1
##                            0 174   9
##                            1  95  40
compute_metrics(confusion.matrix.step.under)
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.6729560" "0.2962963" "0.8163265" "0.6468401"   "0.4347826"
class0_indices = which(test.data$Attrition == 0)
pr_curve_glm <- pr.curve(scores.class0 = predicted.step.under[class0_indices], scores.class1 = predicted.step.under[-class0_indices], curve=TRUE, max.compute = T, min.compute = T, rand.compute = T)
plot(pr_curve_glm, max.plot = FALSE, min.plot = FALSE, rand.plot = FALSE, fill.area = FALSE, color = 2, auc.main = FALSE, ylim = c(0, 1))
lr_auc<- as.numeric(pr_curve_glm$auc.integral)

cat("Area Under the Curve (AUC):", lr_auc, "\n")
## Area Under the Curve (AUC): 0.7236698
text(0.6, 0.4, paste("AUC =", round(lr_auc, 4)), col = "black", cex = 1.2)

Oversampling

We continued to use the ovun.sample method to oversample the training set. This method generates new instances of the minority class until the class distribution becomes balanced.

over.train.data <- ovun.sample(Attrition ~ ., 
                               data = train.data, p = 0.5, method = "over", seed=108)$data
table(over.train.data$Attrition)
## 
##   0   1 
## 790 790

The Logistic Regression model trained on the oversampled training set is the following:

summary(full.model.over)
## 
## Call:
## glm(formula = over.train.data$Attrition ~ . - JobLevel - YearsAtCompany, 
##     family = binomial, data = over.train.data)
## 
## Coefficients:
##                             Estimate  Std. Error z value             Pr(>|z|)
## (Intercept)               6.45749422  0.91876344   7.028    0.000000000002088
## Age                      -0.02286485  0.00990843  -2.308             0.021021
## BusinessTravel            0.96740757  0.13393824   7.223    0.000000000000509
## Department               -0.56014447  0.14099949  -3.973    0.000071071439327
## DistanceFromHome          0.03328940  0.00851547   3.909    0.000092570434441
## Education                -0.17558726  0.06751774  -2.601             0.009306
## EducationField            0.15059333  0.04622681   3.258             0.001123
## EnvironmentSatisfaction  -0.36734732  0.06093993  -6.028    0.000000001659772
## Gender                   -0.49374229  0.14279706  -3.458             0.000545
## JobInvolvement           -0.54683399  0.09407276  -5.813    0.000000006140584
## JobRole                   0.08545792  0.03457613   2.472             0.013451
## JobSatisfaction          -0.34581483  0.06185903  -5.590    0.000000022658734
## MaritalStatus            -0.66847192  0.12216833  -5.472    0.000000044566777
## MonthlyIncome            -0.00011858  0.00002562  -4.628    0.000003687267979
## NumCompaniesWorked        0.14405261  0.02958525   4.869    0.000001121252301
## OverTime                  1.96464365  0.14403286  13.640 < 0.0000000000000002
## PercentSalaryHike        -0.06763351  0.02988017  -2.263             0.023605
## PerformanceRating         0.25723230  0.30833348   0.834             0.404131
## RelationshipSatisfaction -0.13572620  0.06298315  -2.155             0.031165
## StockOptionLevel         -0.13767017  0.10450264  -1.317             0.187710
## TotalWorkingYears        -0.01980694  0.01794954  -1.103             0.269819
## TrainingTimesLastYear    -0.20833565  0.05788950  -3.599             0.000320
## WorkLifeBalance          -0.53857485  0.09065867  -5.941    0.000000002838292
## YearsInCurrentRole       -0.08302401  0.03376607  -2.459             0.013940
## YearsSinceLastPromotion   0.20436861  0.02931361   6.972    0.000000000003129
## YearsWithCurrManager     -0.10446339  0.03281303  -3.184             0.001455
##                             
## (Intercept)              ***
## Age                      *  
## BusinessTravel           ***
## Department               ***
## DistanceFromHome         ***
## Education                ** 
## EducationField           ** 
## EnvironmentSatisfaction  ***
## Gender                   ***
## JobInvolvement           ***
## JobRole                  *  
## JobSatisfaction          ***
## MaritalStatus            ***
## MonthlyIncome            ***
## NumCompaniesWorked       ***
## OverTime                 ***
## PercentSalaryHike        *  
## PerformanceRating           
## RelationshipSatisfaction *  
## StockOptionLevel            
## TotalWorkingYears           
## TrainingTimesLastYear    ***
## WorkLifeBalance          ***
## YearsInCurrentRole       *  
## YearsSinceLastPromotion  ***
## YearsWithCurrManager     ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2190.3  on 1579  degrees of freedom
## Residual deviance: 1420.9  on 1554  degrees of freedom
## AIC: 1472.9
## 
## Number of Fisher Scoring iterations: 5

We can notice that the only variables with a p-value larger than 0.1 are PerformanceRating, StockOptionLevel and TotalWorkingYears.

These are the results:

##                       
## predicted.classes.over   0   1
##                      0 190  12
##                      1  79  37
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7138365" "0.3189655" "0.7551020" "0.7063197"   "0.4484848"

The threshold that maximizes recall is 0.4, which results in only 12 leaving employees being misclassified, with minimal impact on the F1 score. We will proceed with the backward selection using this threshold.

summary(model.step.over)
## 
## Call:
## glm(formula = over.train.data$Attrition ~ Age + BusinessTravel + 
##     Department + DistanceFromHome + Education + EducationField + 
##     EnvironmentSatisfaction + Gender + JobInvolvement + JobRole + 
##     JobSatisfaction + MaritalStatus + MonthlyIncome + NumCompaniesWorked + 
##     OverTime + PercentSalaryHike + RelationshipSatisfaction + 
##     TrainingTimesLastYear + WorkLifeBalance + YearsInCurrentRole + 
##     YearsSinceLastPromotion + YearsWithCurrManager, family = binomial, 
##     data = over.train.data)
## 
## Coefficients:
##                             Estimate  Std. Error z value             Pr(>|z|)
## (Intercept)               7.13574258  0.66537605  10.724 < 0.0000000000000002
## Age                      -0.02808470  0.00846169  -3.319             0.000903
## BusinessTravel            0.96695752  0.13348068   7.244 0.000000000000435076
## Department               -0.56800665  0.13981499  -4.063 0.000048537675384433
## DistanceFromHome          0.03240924  0.00845645   3.832             0.000127
## Education                -0.18294608  0.06732461  -2.717             0.006580
## EducationField            0.15298493  0.04612509   3.317             0.000911
## EnvironmentSatisfaction  -0.36109862  0.06049928  -5.969 0.000000002392350087
## Gender                   -0.49626879  0.14166218  -3.503             0.000460
## JobInvolvement           -0.54956292  0.09363125  -5.869 0.000000004372716457
## JobRole                   0.08467057  0.03461326   2.446             0.014438
## JobSatisfaction          -0.34676045  0.06168309  -5.622 0.000000018914706532
## MaritalStatus            -0.77277184  0.09437820  -8.188 0.000000000000000266
## MonthlyIncome            -0.00013414  0.00002134  -6.285 0.000000000327800097
## NumCompaniesWorked        0.13457742  0.02907479   4.629 0.000003680336249982
## OverTime                  1.95492858  0.14342617  13.630 < 0.0000000000000002
## PercentSalaryHike        -0.04718469  0.01869251  -2.524             0.011594
## RelationshipSatisfaction -0.12241701  0.06241537  -1.961             0.049841
## TrainingTimesLastYear    -0.21132609  0.05750318  -3.675             0.000238
## WorkLifeBalance          -0.55085496  0.08948812  -6.156 0.000000000747839158
## YearsInCurrentRole       -0.08599768  0.03332432  -2.581             0.009862
## YearsSinceLastPromotion   0.19984752  0.02877159   6.946 0.000000000003757839
## YearsWithCurrManager     -0.11173822  0.03242519  -3.446             0.000569
##                             
## (Intercept)              ***
## Age                      ***
## BusinessTravel           ***
## Department               ***
## DistanceFromHome         ***
## Education                ** 
## EducationField           ***
## EnvironmentSatisfaction  ***
## Gender                   ***
## JobInvolvement           ***
## JobRole                  *  
## JobSatisfaction          ***
## MaritalStatus            ***
## MonthlyIncome            ***
## NumCompaniesWorked       ***
## OverTime                 ***
## PercentSalaryHike        *  
## RelationshipSatisfaction *  
## TrainingTimesLastYear    ***
## WorkLifeBalance          ***
## YearsInCurrentRole       ** 
## YearsSinceLastPromotion  ***
## YearsWithCurrManager     ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 2190.3  on 1579  degrees of freedom
## Residual deviance: 1424.5  on 1557  degrees of freedom
## AIC: 1470.5
## 
## Number of Fisher Scoring iterations: 5

The algorithm removed exactly the columns PerformanceRating, StockOptionLevel and TotalWorkingYears that were just mentioned because of their high p-value.

##                            
## predicted.classes.step.over   0   1
##                           0 187  12
##                           1  82  37
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7044025" "0.3109244" "0.7551020" "0.6951673"   "0.4404762"

SMOTE

SMOTE (Synthetic Minority Over-sampling Technique) is a data augmentation that generates synthetic examples for the minority class.

train.data$Attrition = as.factor(train.data$Attrition)
smote.train.data <- smote(Attrition ~ ., train.data, perc.over = 2, k = 5)
table(smote.train.data$Attrition)
## 
##   0   1 
## 660 495

The model trained on the SMOTEd training set is:

full.model.smote <- glm(smote.train.data$Attrition ~ . - JobLevel - YearsAtCompany, data = smote.train.data, family = binomial)
summary(full.model.smote)
## 
## Call:
## glm(formula = smote.train.data$Attrition ~ . - JobLevel - YearsAtCompany, 
##     family = binomial, data = smote.train.data)
## 
## Coefficients:
##                             Estimate  Std. Error z value             Pr(>|z|)
## (Intercept)               7.98817675  1.17017194   6.826      0.0000000000087
## Age                      -0.02380278  0.01318624  -1.805              0.07106
## BusinessTravel            0.99400688  0.16883374   5.887      0.0000000039211
## Department               -0.55772607  0.18027384  -3.094              0.00198
## DistanceFromHome          0.01666558  0.01039088   1.604              0.10874
## Education                -0.13102640  0.08596606  -1.524              0.12747
## EducationField            0.23907772  0.06126245   3.903      0.0000951977891
## EnvironmentSatisfaction  -0.41590030  0.07768143  -5.354      0.0000000860680
## Gender                   -0.72313274  0.17584468  -4.112      0.0000391672573
## JobInvolvement           -0.71816512  0.12200275  -5.886      0.0000000039454
## JobRole                   0.00273708  0.04392030   0.062              0.95031
## JobSatisfaction          -0.38756096  0.07911320  -4.899      0.0000009641631
## MaritalStatus            -0.63382441  0.16137871  -3.928      0.0000858123398
## MonthlyIncome            -0.00007128  0.00003290  -2.167              0.03025
## NumCompaniesWorked        0.15884842  0.03872909   4.102      0.0000410433016
## OverTime                  2.14256801  0.18107257  11.833 < 0.0000000000000002
## PercentSalaryHike        -0.06611594  0.03849501  -1.718              0.08588
## PerformanceRating        -0.00834256  0.38744912  -0.022              0.98282
## RelationshipSatisfaction -0.19309669  0.07925203  -2.436              0.01483
## StockOptionLevel         -0.16553031  0.14071263  -1.176              0.23945
## TotalWorkingYears        -0.06156952  0.02390990  -2.575              0.01002
## TrainingTimesLastYear    -0.04272155  0.07287277  -0.586              0.55771
## WorkLifeBalance          -0.72136300  0.11619370  -6.208      0.0000000005357
## YearsInCurrentRole       -0.06124557  0.04689500  -1.306              0.19155
## YearsSinceLastPromotion   0.18391209  0.03766118   4.883      0.0000010430715
## YearsWithCurrManager     -0.11015197  0.04666833  -2.360              0.01826
##                             
## (Intercept)              ***
## Age                      .  
## BusinessTravel           ***
## Department               ** 
## DistanceFromHome            
## Education                   
## EducationField           ***
## EnvironmentSatisfaction  ***
## Gender                   ***
## JobInvolvement           ***
## JobRole                     
## JobSatisfaction          ***
## MaritalStatus            ***
## MonthlyIncome            *  
## NumCompaniesWorked       ***
## OverTime                 ***
## PercentSalaryHike        .  
## PerformanceRating           
## RelationshipSatisfaction *  
## StockOptionLevel            
## TotalWorkingYears        *  
## TrainingTimesLastYear       
## WorkLifeBalance          ***
## YearsInCurrentRole          
## YearsSinceLastPromotion  ***
## YearsWithCurrManager     *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1577.52  on 1154  degrees of freedom
## Residual deviance:  978.29  on 1129  degrees of freedom
## AIC: 1030.3
## 
## Number of Fisher Scoring iterations: 5

As it is evident from the summary, there are a lot of significant features that yield this result:

predicted.smote <- predict(full.model.smote, newdata = test.data, type = "response")
predicted.classes.smote <- ifelse(predicted.smote > 0.4, 1, 0)
confusion_matrix.s <- table(predicted.classes.smote, test.data$Attrition)
print(confusion_matrix.s)
##                        
## predicted.classes.smote   0   1
##                       0 201  12
##                       1  68  37
compute_metrics(confusion_matrix.s)
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7484277" "0.3523810" "0.7551020" "0.7472119"   "0.4805195"

The model with threshold 0.4 reaches a 75% recall with a 0.48% F1-score. The backward algorithm is able to reduce the AIC from 1444 to 1023, removing the columns StockOptionLevel, PerformanceRatings, TotalWorkingYears, TrainingTimesLastYear, YearsInCurrentRole. However, the performance ever so slighly decreases so we think it’s not worth mentioning it.

Shrinkage Methods

Instead of managing model complexity through subset selection methods, we can fit a model with all the predictors using techniques that constrain the coefficient estimates. In our case, we’ve chosen to apply Ridge and Lasso regression. Ridge regression uses quadratic shrinking, while Lasso regression uses absolute-value shrinking.

Ridge Regularization

Ridge regularization, also known as L2 regularization, adds a penalty term to the loss function, which corresponds to the sum of the squared values of the model’s coefficients. This penalty term is essentially the L2 norm of the coefficients and is represented as:

\(\lambda\sum_{j=1}^{p}\beta_j^2\)

This regularization method encourages the model to keep the coefficients of the variables small, pushing them towards zero.

The shrinkage term is controlled by the hyperparameter \(\lambda\), which we will select using cross-validation. We will consider a range of possible values for \(\lambda\) from 100,000 to 0.001.

grid <- 10^seq(5, -3, length=100)

Ridge Regularization with Unbalanced Dataset

We define the model ridge.mod specifying the grid of lambda values for cross-validation.

X <- model.matrix(Attrition ~ ., data = HR_Analytics)
X <- X[,-1]
y <- HR_Analytics$Attrition
y.test <- y[test]
ridge.mod <- cv.glmnet(X[train,], y[train], alpha = 0,
                    lambda = grid, thresh = 1e-12, standardize = TRUE, nfold = 10)
bestlam <- ridge.mod$lambda.min

These are the results of unbalanced Ridge Regression using the optimal lambda value:

bestlam
## [1] 0.04132012
ridge.pred <- predict(ridge.mod, s = bestlam, newx = X[test, ])
ridge.pred_class <- ifelse(ridge.pred > 0.3, 1, 0)
ridge_confusion_mat <- table(ridge.pred_class, y[test])
print(ridge_confusion_mat)
##                 
## ridge.pred_class   0   1
##                0 224  18
##                1  45  31
ridge.metrics <- compute_metrics(ridge_confusion_mat)
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.8018868" "0.4078947" "0.6326531" "0.8327138"   "0.4960000"

From the plot and the coefficients printed below, it is evident that there is a particularly significant feature: OverTime. Other important variables are BusinessTravel, MaritalStatus , WorkLifeBalance and EnvironmentSatisfaction.

ridge.mod_un <- glmnet(X, y, alpha=0, lambda = grid, thresh = 1e-12, , standardize = TRUE)
plot(ridge.mod_un, xvar="lambda", label=TRUE)

selected_coeffs <- coefficients(ridge.mod_un, s = bestlam)
selected_coeffs
## 28 x 1 sparse Matrix of class "dgCMatrix"
##                                       s1
## (Intercept)               0.815718444792
## Age                      -0.003105269867
## BusinessTravel            0.070970813263
## Department               -0.064478371300
## DistanceFromHome          0.003352160742
## Education                -0.003289655360
## EducationField            0.014208466222
## EnvironmentSatisfaction  -0.041613003190
## Gender                   -0.044662639998
## JobInvolvement           -0.061572386841
## JobLevel                 -0.022114823279
## JobRole                   0.010732371093
## JobSatisfaction          -0.030713194735
## MaritalStatus            -0.053483541964
## MonthlyIncome            -0.000003928772
## NumCompaniesWorked        0.012627405968
## OverTime                  0.187414642316
## PercentSalaryHike        -0.003307936985
## PerformanceRating         0.020944815172
## RelationshipSatisfaction -0.014681473209
## StockOptionLevel         -0.021887701815
## TotalWorkingYears        -0.003033177018
## TrainingTimesLastYear    -0.010661248657
## WorkLifeBalance          -0.032025598637
## YearsAtCompany            0.001492645195
## YearsInCurrentRole       -0.006724181442
## YearsSinceLastPromotion   0.012412286374
## YearsWithCurrManager     -0.008176316332

Ridge Regularization with Oversampling

X1 <- model.matrix(Attrition ~ . , data = over.train.data)
X1 <- X1[,-1]
y <- over.train.data$Attrition
y <- as.numeric(over.train.data$Attrition)
X2 <- model.matrix(Attrition~ . , data = test.data)
X2 <- X2[,-1]
y.test <- test.data$Attrition
plot(ridge.mod)

## [1] 0.01353048
##                      
## ridge.pred_class.over   0   1
##                     0 182  10
##                     1  87  39
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.6949686" "0.3095238" "0.7959184" "0.6765799"   "0.4457143"

By looking at the plot below, it is obvious that there are two features that strongly dominate over the others (which may justify the substantial improvement from the unbalanced Ridge model): 15 and 2, being respectively OverTime and BusinessTravel, followed by features like MaritalStatus, Gender , WorkLifeBalance and Department.

ridge.mod.ov <- glmnet(X1, y, alpha=0, lambda = grid, thresh = 1e-12, , standardize = TRUE)
plot(ridge.mod.ov, xvar="lambda", label=TRUE)

selected_coeffs <- coefficients(ridge.mod.ov, s = bestlam)
selected_coeffs
## 28 x 1 sparse Matrix of class "dgCMatrix"
##                                       s1
## (Intercept)               1.504527005699
## Age                      -0.003176634840
## BusinessTravel            0.130021300151
## Department               -0.086116496383
## DistanceFromHome          0.004306641853
## Education                -0.021532507266
## EducationField            0.025528005502
## EnvironmentSatisfaction  -0.049952831106
## Gender                   -0.062594796299
## JobInvolvement           -0.077458789308
## JobLevel                 -0.052918108298
## JobRole                   0.009872560862
## JobSatisfaction          -0.052639285692
## MaritalStatus            -0.082034709056
## MonthlyIncome            -0.000006020989
## NumCompaniesWorked        0.022832081042
## OverTime                  0.312540810420
## PercentSalaryHike        -0.008242457602
## PerformanceRating         0.015645747225
## RelationshipSatisfaction -0.020579141966
## StockOptionLevel         -0.030097795317
## TotalWorkingYears        -0.005698733716
## TrainingTimesLastYear    -0.029367835454
## WorkLifeBalance          -0.079806078765
## YearsAtCompany            0.014265502589
## YearsInCurrentRole       -0.018531128044
## YearsSinceLastPromotion   0.016854007204
## YearsWithCurrManager     -0.019045082590
## Area Under the Curve (AUC): 0.7199852

Lasso Regularization

Lasso regularization, also known as L1 regularization, adds a penalty term that consists of the sum of the absolute values of the model’s coefficients:

\(\lambda\sum_{j=1}^{p}|\beta_j|\)

This encourages the model’s coefficients to become smaller, and some of them may become exactly zero. In fact, Lasso is a sparsity-inducing approach where the nonzero coefficients left are associated with significant variables.

Similar to Ridge regularization, the penalty term is controlled by a hyperparameter \(\lambda\), which is selected through cross-validation. The search for the optimal \(\lambda\) involves exploring a grid of possible values ranging from 1000 to 0.0001

grid <- 10^seq(3, -4, length=100)

Lasso Regularization with Unbalanced Dataset

Let’s define the model lasso.mod, providing the grid of values from which the function cv.glmnet can peform cross-validation. It will allow to retrieve the \(\lambda\) that yields the best result.

X <- model.matrix(Attrition ~ . , data = HR_Analytics)
X <- X[,-1]
y <- HR_Analytics$Attrition
y.test <- y[test]
lasso.mod <- cv.glmnet(X[train, ], y[train], alpha = 1,
                    lambda = grid, thresh = 1e-12, standardize = TRUE, nfold = 10)
bestlam <- lasso.mod$lambda.min
plot(lasso.mod)

We can now let our model predict using the optimal \(\lambda\) obtained and a threshold of 0.4.

lasso.predict <- predict(lasso.mod, s = bestlam, newx = X[test, ])
lasso.predict_class <- ifelse(lasso.predict > 0.3, 1, 0)
lasso_confusion_mat <- table(lasso.predict_class, y[test])
lasso_confusion_mat
##                    
## lasso.predict_class   0   1
##                   0 220  18
##                   1  49  31
lasso.metrics <- compute_metrics(lasso_confusion_mat)
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7893082" "0.3875000" "0.6326531" "0.8178439"   "0.4806202"

The plot gives a clear representation of how the variable Overtime, in this model as well, proves to be the most meaningful. Other noteworthy predictors are: * BusinessTravel * Department * JobInvolvment

ridge.mod2 <- glmnet(X, y, alpha=1, lambda = grid, thresh = 1e-12, , standardize = TRUE)
plot(ridge.mod2, xvar="lambda", label=TRUE)

selected_coeffs <- coefficients(lasso.mod, s = bestlam)
selected_coeffs
## 28 x 1 sparse Matrix of class "dgCMatrix"
##                                    s1
## (Intercept)               0.966102060
## Age                      -0.003332553
## BusinessTravel            0.086191823
## Department               -0.081402434
## DistanceFromHome          0.002877408
## Education                -0.011615219
## EducationField            0.015104275
## EnvironmentSatisfaction  -0.041219869
## Gender                   -0.054372346
## JobInvolvement           -0.061638026
## JobLevel                 -0.045122365
## JobRole                   0.012206188
## JobSatisfaction          -0.028840010
## MaritalStatus            -0.051117877
## MonthlyIncome             .          
## NumCompaniesWorked        0.015980572
## OverTime                  0.235416118
## PercentSalaryHike        -0.004795753
## PerformanceRating         0.006475347
## RelationshipSatisfaction -0.013226189
## StockOptionLevel         -0.024477219
## TotalWorkingYears        -0.003471853
## TrainingTimesLastYear    -0.010878074
## WorkLifeBalance          -0.052855235
## YearsAtCompany            0.005366176
## YearsInCurrentRole       -0.009372052
## YearsSinceLastPromotion   0.011223508
## YearsWithCurrManager     -0.009732903

Lasso Regularization with Oversampling

Let’s see how Lasso Regularization performs using the oversampled training set.

As before, thanks to cross validation, we can retrive the best lambda:

X1 <- model.matrix(Attrition ~ . - JobLevel - YearsAtCompany, data = over.train.data)
X1 <- X1[,-1]
y <- over.train.data$Attrition
y <- as.numeric(over.train.data$Attrition)
X2 <-  model.matrix(Attrition ~ . - JobLevel - YearsAtCompany, data = test.data)
X2 <- X2[,-1]
y.test <- test.data$Attrition
lasso.mod <- cv.glmnet(X1, y, alpha=1, lambda=grid, standardized=TRUE, nfold = 10)
plot(lasso.mod, label=TRUE)

bestlam.lasso.over <- lasso.mod$lambda.min
## [1] 0.001149757
lasso.pred.over <- predict(lasso.mod, s=bestlam.lasso.over, newx=X2, type="response")
lasso.over <- ifelse(lasso.pred.over > 0.44, 1, 0)
lasso_confusion_mat.over <- table(lasso.over, test.data$Attrition)
lasso_confusion_mat.over
##           
## lasso.over   0   1
##          0 178   9
##          1  91  40
metrics.lasso <- compute_metrics(lasso_confusion_mat.over)
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.6855346" "0.3053435" "0.8163265" "0.6617100"   "0.4444444"

As you can notice from the plot and the coefficients below, the variable OverTime once again contributes the most to the prediction.

ridge.mod2 <- glmnet(X1, y, alpha=1, lambda = grid, thresh = 1e-12, , standardize = TRUE)
plot(ridge.mod2, xvar="lambda", label=TRUE)

selected_coeffs <- coefficients(lasso.mod, s = bestlam)
selected_coeffs
## 26 x 1 sparse Matrix of class "dgCMatrix"
##                                      s1
## (Intercept)               1.51227971311
## Age                      -0.00376031306
## BusinessTravel            0.13945122689
## Department               -0.07992339425
## DistanceFromHome          0.00448770921
## Education                -0.02216796756
## EducationField            0.02610664899
## EnvironmentSatisfaction  -0.05476229299
## Gender                   -0.06616115435
## JobInvolvement           -0.08494680847
## JobRole                   0.01239492544
## JobSatisfaction          -0.05159400366
## MaritalStatus            -0.09181812427
## MonthlyIncome            -0.00001698944
## NumCompaniesWorked        0.02109520326
## OverTime                  0.33351819309
## PercentSalaryHike        -0.00879581719
## PerformanceRating         0.02056910279
## RelationshipSatisfaction -0.02048049377
## StockOptionLevel         -0.02564077293
## TotalWorkingYears        -0.00308224581
## TrainingTimesLastYear    -0.02968206648
## WorkLifeBalance          -0.08637608177
## YearsInCurrentRole       -0.01147142270
## YearsSinceLastPromotion   0.02704394225
## YearsWithCurrManager     -0.01500106023

Naive Bayes

The Naive Bayes model is a classification model that is based on Bayes’ theorem. It assumes a strong independence among predictors, hence the ‘naive’ designation. However, it’s important to note that this independence condition is not satisfied, as observed in the exploratory data analysis (EDA), but we proceed with the model anyway.

Naive Bayes with Unbalanced Data

Here follows the Naive Bayes model applied to the unbalanced training set, using a threshold of 0.4.

train.data$Attrition <- as.factor(train.data$Attrition)

naivebayes.model <- naive_bayes(Attrition ~ . - JobLevel - YearsAtCompany, data = train.data)
naivebayes.prediction <- predict(naivebayes.model, newdata = test.data, type = "prob")
naivebayes.y_pred_class <- ifelse(naivebayes.prediction[, "1"] > 0.4, 1, 0)
naivebayesmat <- table(naivebayes.y_pred_class, test.data$Attrition)
naivebayesmat
##                        
## naivebayes.y_pred_class   0   1
##                       0 191  16
##                       1  78  33
naivebayes.metrics <- compute_metrics(naivebayesmat)
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7044025" "0.2972973" "0.6734694" "0.7100372"   "0.4125000"

Naive Bayes with Oversampling

Using the oversampled training set and maintaining the same threshold, the recall improves by 12% percent. Of course, such tendency clearly penalizes the overall F1-score that drops by a lot.

##                     
## naivebayes.pred_over   0   1
##                    0 132  11
##                    1 137  38
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.5345912" "0.2171429" "0.7755102" "0.4907063"   "0.3392857"

LDA

Linear Discriminant Analysis is a classification model that also aims at detecting a subspace that maximizes the separation between the classes while minimizing the variation within each class, so that our data lays in a subspace where they are well-separated.

LDA is based on two assumptions:

  • data within each class follows a normal distribution;
  • covariance matrices of different classes are equal.

Let’s check if our features are normally distributed by defining the following function that allows us to visibly estimate the normality of our dataset columns by plotting the density and QQ plots. Furthermore, it computes the Shapiro-Wilk test that numerically assesses normality.

check.for.normality <- function(column) {
  par(mfrow = c(1,2))
  plot(density(column), main = paste("Density Plot of", colnames(HR_Analytics)[i]), xlab = "...")
  qqnorm(y=HR_Analytics[,i],main= paste("QQ plot of",colnames(HR_Analytics)[i],sep = " "))
  qqline(y=HR_Analytics[,i])
  par(mfrow=c(1,1))
  print(paste("Shapiro-Wilk Test",colnames(HR_Analytics)[i], sep=' '))
  print(shapiro.test(column))
}

For brevity, we are only going to show the analysis of two variables: except for the variable Age, the Shapiro-Wilk test always results in a p-value smaller than 2.2e-16. We can conclude that our variables don’t have a normal distribution, but we are going to proceed with the application of LDA regardless, expecting that the model performance is affected by the fact that the two assumptions are not met.

## [1] "Shapiro-Wilk Test Age"
## 
##  Shapiro-Wilk normality test
## 
## data:  column
## W = 0.98063, p-value = 0.000000000004959

## [1] "Shapiro-Wilk Test Education"
## 
##  Shapiro-Wilk normality test
## 
## data:  column
## W = 0.89516, p-value < 0.00000000000000022

LDA with Unbalanced Dataset

Here follows the application of LDA to our unbalanced training set. We are trying out different thresholds to determine which one improves the model performance, choosing, as usual, from the usual range of values 0.4, 0.5 and 0.6.

lda.fit <- lda(Attrition ~ . - JobLevel - YearsAtCompany, data = train.data)
lda.pred <- predict(lda.fit, test.data, type="response")
lda.res <- lda.pred$posterior
for (t in c(0.4, 0.5, 0.6)){
lda.pred.best <- as.factor(ifelse(lda.res[,2] > t, 1, 0))
lda.conf.mat <- t(table(test.data$Attrition, lda.pred.best))
print(lda.conf.mat)
compute_metrics(lda.conf.mat)
lda.error <-  mean(test.data$Attrition != lda.pred.best)
print(lda.error)
}
##              
## lda.pred.best   0   1
##             0 246  24
##             1  23  25
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.8522013" "0.5208333" "0.5102041" "0.9144981"   "0.5154639"
## [1] 0.1477987
##              
## lda.pred.best   0   1
##             0 258  29
##             1  11  20
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.8742138" "0.6451613" "0.4081633" "0.9591078"   "0.5000000"
## [1] 0.1257862
##              
## lda.pred.best   0   1
##             0 261  33
##             1   8  16
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.8710692" "0.6666667" "0.3265306" "0.9702602"   "0.4383562"
## [1] 0.1289308

Evidently, the threshold 0.6 induces a significantly higher recall if compared to the other two models, without penalizing excessively the F1-score. Such model also provides the lower mean error among the thresholds, so it’s easy to conclude that the model with threshold 0.4 has the best performance.

To keep it brief, we are reporting below only a few coefficients to prove that the main direction - the one with the highest magnitude - is OverTime. It is the variable that best separates the data, even though it is clear that it doesn’t do it sufficiently, just like we were expecting given the non linear separability of our data. Such property can be seen thanks to LDA: the density of the two classes are largely overlapping.

## [1] "NumCompaniesWorked 0.0952166695814974"
## [1] "OverTime 1.43402644528933"
## [1] "PercentSalaryHike -0.0394597894704211"
## [1] "PerformanceRating 0.153788365467927"

LDA with Oversampling

The results derived from the application of LDA model on the oversampled training data show quite a drastic improvement in the model’s performance.

lda.fit <- lda(Attrition ~ .  - JobLevel - YearsAtCompany, data = over.train.data)
lda.pred_over <- predict(lda.fit, test.data, type="response")
lda.res_over <- lda.pred_over$posterior

lda.pred.best <- as.factor(ifelse(lda.res_over[,2] > 0.415, 1, 0))
lda.conf.mat <- t(table(test.data$Attrition, lda.pred.best))
print(lda.conf.mat)
##              
## lda.pred.best   0   1
##             0 182  10
##             1  87  39
metrics.lda=compute_metrics(lda.conf.mat)
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.6949686" "0.3095238" "0.7959184" "0.6765799"   "0.4457143"
lda.error <-  mean(test.data$Attrition != lda.pred.best)
print(lda.error)
## [1] 0.3050314
plot(lda.fit)

QDA

Quadratic Discriminant Analysis is a classification model that exploits a quadratic decision boundary to separate the two classes of data. It is based on two main assumptions:

  • Each class is drawn from a normal distribution

  • Each class has a different covariance matrix.

The first assumptions is shared with LDA model: we have already proved that the dataset doesn’t follow a normal distribution. However, as we did with LDA, we are going to evaluate this model anyway, aware of its limitations due to the conditions not being satisfied.

QDA with Unbalanced Dataset

We start with the unbalanced dataset. It yields the same results regardless of the threshold used.

qda.fit <- qda(Attrition ~ .  - JobLevel - YearsAtCompany, data = train.data)
qda.pred <- predict(qda.fit, test.data, type="response")
qda.res <- qda.pred$posterior


qda.pred.best <- as.factor(ifelse(qda.res[,2] > 0.4, 1, 0))
qda.conf.mat <- t(table(test.data$Attrition, qda.pred.best))
print(qda.conf.mat)
##              
## qda.pred.best   0   1
##             0 232  30
##             1  37  19
compute_metrics(qda.conf.mat)
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7893082" "0.3392857" "0.3877551" "0.8624535"   "0.3619048"
qda.error <-  mean(test.data$Attrition != qda.pred.best)
print(qda.error)
## [1] 0.2106918

We obtain 21% mean error. It is visible that, regardless of the thresholds, a high specificity is achieved: the models barely misclassify the employees who don’t plan to leave the workplace. At the same time, the quite high precision guarantees that most of the employees predicted as leaving actually leave.

QDA with Oversampling

qda.fit <- qda(Attrition ~ . - JobLevel - YearsAtCompany, data = over.train.data)
qda.pred_over <- predict(qda.fit, test.data, type="response")
qda.res_over <- qda.pred_over$posterior

qda.pred.best <- as.factor(ifelse(qda.res_over[,2] > 0.4, 1, 0))
qda.conf.mat <- table(test.data$Attrition, qda.pred.best)
print(t(qda.conf.mat))
##              
## qda.pred.best   0   1
##             0 197  29
##             1  72  20
metrics.qda<-compute_metrics(t(qda.conf.mat))
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.6823899" "0.2173913" "0.4081633" "0.7323420"   "0.2836879"
qda.error <-  mean(test.data$Attrition != qda.pred.best)
print(qda.error)
## [1] 0.3176101

KNN

KNN is an instance-based learning algorithm that is based on the assumptions that examples with similar features belong to the same class. Given an unseen example, the model assigns the label of the majority class among the K nearest neighbors. Since it involves the computation of distances between data examples, it is necessary to scale data so that features with a largest magnitude don’t dominate.

n <- nrow(HR_Analytics)
scaled.data <- as.data.frame(scale(HR_Analytics[-2]))
set.seed(108)
train <- sample(1:n, round(n * 0.75))
test <- setdiff(1:n, train)

train_data <- scaled.data[train, ]
train_data$Attrition <- HR_Analytics[train, 2]
test_data <- scaled.data[test, ]
test_data$Attrition <- HR_Analytics[test, 2]

over.train_data <- ovun.sample(Attrition ~ ., 
                               data = train_data, p = 0.5, method = "over", seed = 108)$data

best_f1_score <- 0  # Initialize with a low value
best_k <- 0
consecutive_increases <- 0  # Initialize a counter for consecutive increases

for (K in 3:50) {
  cat("\nK =", K)
  predicted.classes <- knn(over.train_data, test_data, over.train_data$Attrition, K)
  conf.matrix <- table(predicted.classes, test_data$Attrition)
  metrics <- compute_metrics(conf.matrix)
  current_f1_score <- metrics[2, which(metrics[1, ] == "F1 score")]
  
  if (current_f1_score > best_f1_score) {
    best_f1_score <- current_f1_score
    best_k <- K
    consecutive_increases <- 0  # Reset the counter
  } else {
    consecutive_increases <- consecutive_increases + 1
    if (consecutive_increases >= 5) {  # Adjust the number of consecutive increases as needed
      break
    }
  }
}
## 
## K = 3       [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7704403" "0.3421053" "0.5306122" "0.8141264"   "0.4160000"
## 
## K = 4       [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7484277" "0.3218391" "0.5714286" "0.7806691"   "0.4117647"
## 
## K = 5       [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7295597" "0.3052632" "0.5918367" "0.7546468"   "0.4027778"
## 
## K = 6       [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7327044" "0.3085106" "0.5918367" "0.7583643"   "0.4055944"
## 
## K = 7       [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7484277" "0.3333333" "0.6326531" "0.7695167"   "0.4366197"
## 
## K = 8       [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7452830" "0.3260870" "0.6122449" "0.7695167"   "0.4255319"
## 
## K = 9       [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7515723" "0.3255814" "0.5714286" "0.7843866"   "0.4148148"
## 
## K = 10       [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7547170" "0.3294118" "0.5714286" "0.7881041"   "0.4179104"
## 
## K = 11       [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7547170" "0.3333333" "0.5918367" "0.7843866"   "0.4264706"
## 
## K = 12       [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7421384" "0.3103448" "0.5510204" "0.7769517"   "0.3970588"
cat("\nBest K value:", best_k)
## 
## Best K value: 7
# Now, use the best K value to make predictions
knn.pred <- knn(over.train_data, test_data, over.train_data$Attrition, best_k)
conf.matrix <- table(knn.pred, test_data$Attrition)
conf.matrix
##         
## knn.pred   0   1
##        0 206  19
##        1  63  30
metrics.knn <- compute_metrics(conf.matrix)
##        [,1]        [,2]        [,3]        [,4]          [,5]       
## metric "Accuracy"  "Precision" "Recall"    "Specificity" "F1 score" 
## values "0.7421384" "0.3225806" "0.6122449" "0.7657993"   "0.4225352"

We are only showing a few instances of KNN that evidently demonstrate the behaviour of the model: specificity and accuracy improve with K, which is an hyperparameter. In fact, since our test data is highly imbalanced, as K increases, it becomes easier for the model to classify any unseen data point as the majority class label.

Comparison between the models

Model_name Accuracy Precision Recall F1_score
Lasso Regularization 0.6855346 0.3053435 0.8163265 0.4444444
Ridge Regularization 0.6949686 0.3095238 0.7959184 0.4457143
LDA 0.6949686 0.3095238 0.7959184 0.4457143
Naive Bayes 0.5345912 0.2171429 0.7755102 0.3392857
Logistic Regression 0.7138365 0.3189655 0.7551020 0.4484848
Logistic regression with Backward Selection 0.7044025 0.3109244 0.7551020 0.4404762
KNN 0.7421384 0.3225806 0.6122449 0.4225352
Logistic Regression Unbalanced 0.8459119 0.5000000 0.4285714 0.4615385
QDA 0.6823899 0.2173913 0.4081633 0.2836879

## [1] 0.7125

Misclassification Analysis

We proceeded to analyze the misclassified instances for the logistic regression model after applying backward selection. In general, our models, with the right threshold, managed to achieve a good recall value, with a low number of misclassified false negatives but a higher number of false positives. Observing the following plots, it is noticeable that the instances misclassified the most are those belonging to categories with a higher probability of Attrition, as we have seen in the EDA. We showed a tendency to leave their work among employees who are single, work overtime, or travel frequently for work. Analyzing the misclassified instances, it’s evident that employers in these categories who don’t quit are more challenging to classify correctly, as they share many patterns with their colleagues who do leave. In the following plots, we observe this behavior for the features MaritalStatus, OverTime, and BusinessTravel.

test.data <- test.data %>%
  mutate(Misclassified = ifelse(predicted.classes.over != test.data$Attrition, 1, 0),
         MisclassifiedFP = ifelse(predicted.classes.over == 1 & test.data$Attrition == 0, 1, 0),
         MisclassifiedFN = ifelse(predicted.classes.over== 0 & test.data$Attrition == 1, 1, 0))
marital_stats <- test.data %>%
  group_by(MaritalStatus) %>%
  summarise(CorrectPercentage = sum(1 - Misclassified) / n() * 100,
            MisclassifiedFPPercentage = sum(MisclassifiedFP) / n() * 100,
            MisclassifiedFNPercentage = sum(MisclassifiedFN) / n() * 100) %>%
  pivot_longer(cols = c("CorrectPercentage", "MisclassifiedFPPercentage", "MisclassifiedFNPercentage"),
               names_to = "PercentageType", values_to = "Percentage")


marital_stats$MaritalStatus <- factor(marital_stats$MaritalStatus, levels = c(1, 0, 2),
                                      labels = c("Single", "Married", "Divorced"))


ggplot(data = marital_stats, aes(x = MaritalStatus, y = Percentage, fill = PercentageType)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Percentage of Correctly and Misclassified Instances by MaritalStatus",
       x = "MaritalStatus", y = "Percentage") +
  scale_fill_manual(values = c("CorrectPercentage" = "#7FB942", "MisclassifiedFPPercentage" = "#B9427F", "MisclassifiedFNPercentage" = "#427FB9")) +
  theme_minimal() +
  theme(legend.position = "top")

business_travel_stats <- test.data %>%
  group_by(BusinessTravel) %>%
  summarise(CorrectPercentage = sum(1 - Misclassified) / n() * 100,
            MisclassifiedFPPercentage = sum(MisclassifiedFP) / n() * 100,
            MisclassifiedFNPercentage = sum(MisclassifiedFN) / n() * 100) %>%
  pivot_longer(cols = c("CorrectPercentage", "MisclassifiedFPPercentage", "MisclassifiedFNPercentage"),
               names_to = "PercentageType", values_to = "Percentage")



business_travel_stats$BusinessTravel <- factor(business_travel_stats$BusinessTravel,
                                               levels = c(0, 1, 2),
                                               labels = c("Non-Travel", "Travel_Rarely", "Travel_Frequently"))


ggplot(data = business_travel_stats, aes(x = BusinessTravel, y = Percentage, fill = PercentageType)) +
  geom_bar(stat = "identity", position = "stack") +
  labs(title = "Percentage of Correctly and Misclassified Instances by BusinessTravel",
       x = "BusinessTravel", y = "Percentage") +
  scale_fill_manual(values = c("CorrectPercentage" = "#7FB942", "MisclassifiedFPPercentage" = "#B9427F", "MisclassifiedFNPercentage" = "#427FB9")) +
  theme_minimal() +
  theme(legend.position = "top", plot.title = element_text(size = 12))

Employees who both travel frequently for work and work overtime are more likely to be misclassified.

## `summarise()` has grouped output by 'BusinessTravel'. You can override using
## the `.groups` argument.

In our exploratory data analysis, we observed that employees in managerial and research director roles tend to have longer tenures with the company and receive higher salaries. Consequently, they are less likely to leave their positions and are also less prone to misclassification. In contrast, sales representatives and HR_Analytics staff exhibit a different pattern, with a higher likelihood of both attrition and misclassification.

We then proceeded to visualize the distribution of monthly income, stratified into the correctly classified and misclassified.

Conclusions

We realized that our models have quite similar metrics and a tendency to overfit the test data. Ultimately, we decided to opt for the model with the highest recall - which is Lasso Regression - as our goal is to identify employees who may be considering leaving the company. Lasso Regression not only has the highest recall but also maintains a good balance with the other metrics, boasting the biggest area under the precision-recall curve. The actions the company should take to prevent attrition should not harm the employees who our model misclassifies, so it’s important to make these investments wisely. From the chosen model, it is evident that the most significant feature is Overtime. We recommend the company take this into consideration by potentially adjusting the overtime rates and establishing limits on weekly overtime hours. Furthermore, it’s crucial to gain a deeper understanding of the underlying factors contributing to this phenomenon. One way to do this is by conducting surveys to comprehend the workflow within the company. This survey can help identify if there are significant factors, especially related to teamwork, that may be slowing down productivity. It’s also essential to determine whether employees are voluntarily opting to work overtime due to financial needs or if it’s a result of a stressful or unhealthy work environment.